Final Project - Predicting Health Indicators

Author

Leanne Chook, Maeve Grady, Patrick Jones and Elena Spielmann

Introduction

Obesity has reached epidemic proportions in the United States, and affects people of all ages, socioeconomic backgrounds and ethnicities. This imposes substantial economic burdens in the form of productivity losses, strains to the healthcare system and costs.

The prevalence of obesity in the United States was approximately 41.9% in 2017 to 2020, of that, the prevalence of severe obesity also increased to 9.2%. This increasing trend is worrying because obesity can lead to several other health conditions such as heart disease, strokes, type 2 diabetes, and certain types of cancers, all of which are the leading causes of preventable and premature deaths in the country. For these reasons, determining predictors of obesity is a public health priority.

Furthermore, this increasing trend of obesity also has other implications on national security, especially as more individuals become ineligible for the military, therefore, significantly reducing military recruitment numbers.

Motivating question:

What health, social, and economic indicators are most important in predicting obesity?

The main motivating question for this project was to examine obesity rates across the United States and find predictors that can help policymakers make effective decisions to help reduce obesity prevalence. Each of the selected data sets brings in different predictors to the model, covering health, social and economic aspects of obesity. This is relevant to the public policy sphere because the marrying of these different aspects are key to meaningful policy change and effective spending of government funds. Determining important predictors can help policymakers make effective decisions to help reduce obesity prevalence, which is a public health priority.

Outline of project:

We will examine and clean up the available information from datasets downloaded from PLACES and the Food Access Repository Atlas (FARA). We will then conduct exploratory data analysis (EDA) and geospatial analysis to guide our decision for the final models, before creating several predictive models to find the best model with highest predictive power. We will also determine what the most important variables for predicting obesity are for each model.

Setup (loading packages)

Code
#install.packages("httr")
library(httr)
library(sf)
#install.packages("ggiraph")
library(tmap)
library(dotenv)
library(here)
library(readxl)
library(dplyr)
library(tidyverse)
library(tidymodels)
library(themis)
library(rpart.plot)
library(vip)
library(lubridate)
library(rpart)
library(ranger)
library(ggplot2)
library(parsnip)
library(yardstick)
library(sf)
library(janitor) 
library(stringr)
library(RSocrata)
library(stringr)
library(mice)
library(sp)
library(tidyr)
library(caret)
library(patchwork)
library(tidycensus)
library(tigris)
library(purrr)
options(tigris_use_cache = TRUE)
library(RColorBrewer)
library(ggiraph)
library(mapview)

Reading in Data using APIs

This endeavor used both the CDC’s PLACES dataset and the USDA’s FARA dataset. The team attempted to read each dataset in using API query’s (attempts below) but ran into limitations in the APIs and in technical expertise. In the case of the FARA dataset, the API available is hosted through ESRI as an ArcGIS REST service. This means that the query must be read in through an st_read()command, which made it very difficult to troubleshoot query errors. There was limited documentation available for the FARA API.

For the PLACES API, we successfully read in the data only to find that the API was not equipped to deliver data at the census tract level, only the county level. In both cases we opted instead to download data directly from the source websites at https://chronicdata.cdc.gov/browse?q=PLACES%202022 for PLACES and https://www.ers.usda.gov/data-products/food-access-research-atlas/download-the-data/ for the FARA dataset.

After joining these two datasets, we also read in census tract shapefiles using the census API and library(tidycensus).

Code
# | eval: false
# accessing FARA data through ArcGIS REST API
## to write this code I consulted this blog post: https://community.esri.com/t5/gis-blog/accessing-arcgis-rest-services-using-r/ba-p/898451

url <- parse_url("https://gis.ers.usda.gov/arcgis/rest/services")
url$path <- paste(url$path, "foodaccess2019/MapServer/0/query", sep = "/")
url$query <- list(returnGeometry = "true",
                  f = "geojson",
                  outFields = "*")

request <- build_url(url)

foodaccess <- st_read(request) 
Reading layer `GeoJSONSeq' from data source 
  `https://gis.ers.usda.gov/arcgis/rest/services/foodaccess2019/MapServer/0/query?returnGeometry=true&f=geojson&outFields=%2A' 
  using driver `GeoJSONSeq'
Simple feature collection with 1 feature and 0 fields (with 1 geometry empty)
Geometry type: GEOMETRYCOLLECTION
Dimension:     XY
Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
Geodetic CRS:  WGS 84
Code
## using this api to try to figure out how the query should work:
## https://gis.ers.usda.gov/arcgis/rest/services/foodaccess2019/MapServer/0/query?where=&text=&objectIds=&time=&geometry=&geometryType=esriGeometryPolygon&inSR=&spatialRel=esriSpatialRelIntersects&distance=&units=esriSRUnit_Foot&relationParam=&outFields=*&returnGeometry=true&returnTrueCurves=false&maxAllowableOffset=&geometryPrecision=&outSR=&havingClause=&returnIdsOnly=false&returnCountOnly=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&returnZ=false&returnM=false&gdbVersion=&historicMoment=&returnDistinctValues=false&resultOffset=&resultRecordCount=&returnExtentOnly=false&datumTransformation=&parameterValues=&rangeValues=&quantizationParameters=&featureEncoding=esriDefault&f=html
Code
#install.packages("RSocrata")
library(RSocrata) ## the Rsocrata package has to be loaded in after httr package for the above code to work 

## reading in api credentials
load_dot_env(here(".env"))
app_token <- Sys.getenv("PLACES_app_token")
user_name <- Sys.getenv("PLACES_username")
password <- Sys.getenv("PLACES_password")

# PLACES dataset

places <- read.socrata(
  "https://chronicdata.cdc.gov/resource/swc5-untb.json",
  app_token = paste(app_token),
  email     = paste(user_name),
  password  = paste(password)
)

Combining the Datasets

The federal government provides a wide variety of publicly available health outcomes data. Our group explored a variety of different sources, but ultimately settled on two datasets: the USDA’s The Food Access Research Atlas and the CDC’s PLACES database.

The Food Access Research Atlas (FARA) provides a variety of food access measures for low income and low access census tracts. Food access measures use income, transportation, and distance from grocery stores and other food sellers to determine how accessible food is to residents of a given census tract. This dataset includes distinct measures of access for urban and rural populations.

PLACES provides a variety of health outcome measures including obesity, arthritis, and diabetes, as well as health behaviors such as sleep and smoking. The data in places is largely drawn from the CDC’s Behavioral Risk Factor Surveillance system survey and the National Survey of Children’s health. Data from PLACES is organized at the census tract level.

Our team downloaded both data files directly from the CDC/USDA websites. Because we worked with two separate datasets, our team had to merge these two datasets together before conducting exploratory data analysis and modeling. The USDA organizes the FARA dataset in the tidy format, the dataset needed minimal cleaning and manipulation prior to merging. On the other hand, the CDC does not organize the PLACES dataset in the tidy format. In its original format, PLACES lists data in the long format, with each variable individually listed for each census tract. Our team pivoted this data to the wide format, giving each variable its own column. We also dropped a number of variables that were not relevant to our analysis, such as data source, footnotes, and upper and lower limit estimates. After converting the PLACES dataset, our team successfully merged both datasets together.

Our team wanted to perform geospatial exploratory data analysis in addition to standard EDA. To accomplish this, we pulled census tract shape files from the CDC using their publicly available API. We merged this shapefile to our combined dataset to enable geospatial EDA, we set this combined dataset’s CRS to 4326.

Code
# load the FARA dataset 
fara <- read_excel("fara_2019.xlsx")

# clean the variable names 
fara <- clean_names(fara)

# load the PLACES dataset
places <- st_read("PLACES_ Local Data for Better Health, Census Tract Data 2022 release.geojson") %>% 
  st_transform(value = 4326)
Reading layer `PLACES_ Local Data for Better Health, Census Tract Data 2022 release' from data source `/Users/leannechook/Desktop/school_documents/gradschool/spring_2023/data_science/assignments/ppol670_finalproject/PLACES_ Local Data for Better Health, Census Tract Data 2022 release.geojson' 
  using driver `GeoJSON'
Simple feature collection with 2161543 features and 22 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -173.0177 ymin: 19.25282 xmax: -67.00993 ymax: 71.23014
Geodetic CRS:  WGS 84
Code
places <- places %>%
  rename(census_tract = locationid)

# creating tidy dataset
places_tidy <- places

# Filtering observations to 2020 only
places_tidy <- places %>%
  filter(year == 2020)

# There are a number of columns that are not necessary, they need to be dropped as well
places_tidy <- places_tidy %>%
  dplyr::select(-(c(statedesc, datasource, category, measure, data_value_unit,
                    data_value_type, low_confidence_limit, high_confidence_limit,
                    categoryid, datavaluetypeid, short_question_text, countyfips, 
                    locationname, data_value_footnote, data_value_footnote_symbol,
  )))

# Finally, we can pivot this dataset to wide format
places_tidy <- places_tidy %>%
  pivot_wider(names_from = "measureid",
              values_from = "data_value")

# merge the data frames by census tract
# swapped places and fara so that the combined dataset only has census tracts that are in fara
combined <- right_join(places_tidy, fara,
                       by = "census_tract")

# Read in the shapefile - opting for tidy census here
dotenv::load_dot_env()

api_key <- Sys.getenv("CENSUS_API_KEY")

options(tigris_use_cache = TRUE)

allstates <- c(state.abb)
allstates
 [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA"
[16] "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ"
[31] "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT"
[46] "VA" "WA" "WV" "WI" "WY"
Code
census <- get_acs(geography = "tract", 
                  variables =  "B19013_001",
                  year = 2019,
                  state = allstates, 
                  geometry = TRUE) 

census <- census %>%
  rename(census_tract = GEOID) %>%
  st_transform(crs = st_crs(4326))

# binding census shapefile to this dataset
combined_nonsf <- combined %>% 
  st_transform(crs = st_crs(4326))

combined_census <- st_join(census, combined_nonsf, join = st_intersects)

Exploratory Data Analysis

Before building models, our team conducted exploratory data analysis in order to identify variables of interest and their relationships with our outcome variable (obesity). An initial glimpse at our dataset revealed that a large number of numeric variables were coded as character variables, making them unsuitable for analysis and modeling. We identified and converted these variables to numeric format, while leaving character variables untouched. We recoded a number of binary variables as numeric rather than factors, as factors interfered with our models.

Code
# A ton of our numeric variables are stored as character variables, they need to be converted to numeric variables
to_convert <- colnames(select_if(combined, is.character))

to_convert <- to_convert[!to_convert %in% c("census_tract", "countyname", "year", "stateabbr", "geometry")]

to_convert <- unlist(to_convert)

combined <- combined %>%
  mutate_at(to_convert, as.numeric)

Comparing Obesity Prevalence with Variables of Interest

We continued exploring our data by creating a number of graphs of the relationships between variables of interest and obesity. The variables of interest that we focused on are: 1) Poverty Rate; 2) Median Family Income; 3) Prevalence of Poor Mental Health; and 4) Frequency of Health Checkups.

To make this process easier, we wrote custom functions to make multiple plots.

Code
#' eda_point documentation
#'
#' @param x An independent numeric variable such as poverty rate
#' @param xlab A human readable title for the x-axis
#' @return A geom_point scatter plot of x vs y by fill
#' @export Does not export
#'
#' @examples
#' 
eda_point <- function(x, xlab) {
  
  plot <- ggplot(data = combined, 
                 aes(x = {{ x }},
                     y = OBESITY)) +
    geom_point(aes(fill = as.factor(urban)),
               color = "white", 
               shape = 21) +
    labs(x = xlab, y = "Obesity Prevalence") +
    scale_fill_discrete(name = NULL,
                        breaks = c(0, 1),
                        labels = c("Rural", "Urban")) +
    theme_minimal()
  
  return(plot)
}

eda_point(poverty_rate, "Poverty Rate")

Code
eda_point(median_family_income, "Median Family Income")

Code
eda_point(MHLTH, "Prevalence of Poor Mental Health")

Code
eda_point(CHECKUP, "Frequency of Health Checkups")

These plots reveal a number of insights about our data. While the average obesity rate is higher in rural areas, urban areas have higher variability of data. Unsurprisingly, as the poverty rate increases, obesity rates increase as well. The next graph reflects this as well, showing that obesity declines as median family income increases. As expected, poor mental health also has a positive correlation with obesity prevalence. Regular health checkups, however, do not have a clear relationship with obesity. Furthermore, there is a much wider dispersion of rural census tracts in this plot as well.

It is also useful to look at data variability in urban and rural areas. While the average obesity rate is higher in rural areas, urban areas have higher variability of data.

Code
ggplot(combined, aes(x = factor(urban), y = OBESITY, fill = factor(urban))) +
  geom_boxplot() +
  labs(x = "Rural/Urban", y = "Obesity Prevalence") +
  scale_fill_manual(values = c("blue", "green")) +
  ggtitle("Distribution of Obesity Prevalence by Rural/Urban Areas")

Summary Statistics by State Exploration by state highlights the importance of exploring these data by census tract instead of state. Exploring by state would be misleading.

Code
# Calculate summary statistics by state
summary_stats <- combined %>%
  group_by(stateabbr) %>%
  summarize(mean_obesity = mean(OBESITY, na.rm = TRUE),
            median_income = median(median_family_income, na.rm = TRUE),
            sd_food_access = sd(la1and10, na.rm = TRUE))

head(summary_stats)
Simple feature collection with 6 features and 4 fields
Geometry type: MULTIPOINT
Dimension:     XY
Bounding box:  xmin: -173.0177 ymin: 30.2377 xmax: -84.99064 ymax: 71.23014
Geodetic CRS:  WGS 84
# A tibble: 6 × 5
  stateabbr mean_obesity median_income sd_food_access                   geometry
  <chr>            <dbl>         <dbl>          <dbl>           <MULTIPOINT [°]>
1 AK                32.1        84524           0.501 ((-156.7348 71.23014), (-…
2 AL                40.3        54489           0.495 ((-85.58971 34.58632), (-…
3 AR                38.0        53672           0.500 ((-90.86605 34.31843), (-…
4 AZ                31.0        62880           0.497 ((-113.382 36.52239), (-1…
5 CA                29.1        78986           0.430 ((-120.5514 41.50297), (-…
6 CO                25.3        80828.          0.489 ((-102.6192 38.09705), (-…
Code
#' eda_bar
#'
#' @param y A variable of interest to be plotted by state 
#' @param ylab A human readable title for the y axis
#'
#' @return
#' @export
#'
#' @examples
#' 
eda_bar <- function(y, ylab){
  plot <- ggplot(summary_stats, 
                 aes(x = stateabbr, 
                     y = {{ y }})) +
    geom_bar(stat = "identity", 
             fill = "dodgerblue") +
    labs(x = "State", 
         y = ylab) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    theme_minimal() +
    coord_flip()
  
  return(plot)
}

eda_bar(mean_obesity, "Mean Obesity")

Code
eda_bar(median_income, "Median Income")

Code
eda_bar(sd_food_access, "Standard Deviation of Food Access")

Based on the graphs above, we can determine generally that southern states have the highest rates of obesity. HI has the lowest rate of obesity. DC has the highest median income and MS has the lowest. Most states have a high degree of variability which suggests that food access varies widely across census tracts within a state.

Spread of obesity prevalence across the United States

Code
combined_census <- combined_census %>% 
  clean_names() %>% 
  filter()

breaks <- c(0, 10, 20, 30, 40, 50, 60, 70)

# binning obesity rates and creating text to work with
combined_census <- combined_census %>%
  mutate(obesity_num = as.numeric(obesity),
         obesity_bin = cut(obesity_num, breaks = 10),
         maptext = paste0("State: ", as.character(stateabbr), "\n",
                          "Tract: ", as.character(census_tract_x), "\n",
                          "Obesity rate: ", as.character(obesity), "\n",
                          "Rate of current smokers: ", as.character(csmoking), "\n",
                          "Rate of teeth lost: ", as.character(teethlost), "\n",
                          "Median family income: ", as.character(median_family_income)))

# plot the graph 
obesity_tracts <-  ggplot(combined_census) +
  geom_sf(aes(fill = obesity_bin),
          lwd = 0,
          color = NA)+
  scale_fill_brewer(palette = "RdYlGn", direction  = -1)+
  labs(title = "Adult obesity rate by census tract in the US",
       fill = "Prevalence of\nObesity") +
  coord_sf(xlim = c(-160, -64),
           ylim = c(18, 72), 
           crs = 4326) +
  theme_void()

obesity_tracts

Code
#this codeblock is an interactive map of obesity by census tract that includes some other information that can pop up when hovering over each tract. While the code runs, it takes a very long time to run so we have chosen not to evaluate it here. 
interactivemapdata <- combined_census %>% 
  select(name, state, census_tract_x, obesity_num, median_family_income, csmoking, teethlost, maptext) 

interactivemapdata %>% mapview(zcol = "obesity_num",
                               map.types = "Esri.NatGeoWorldMap",
                               col.regions = brewer.pal(8, "RdYlGn"),
                               color = NA)

obesity_tracts_int <-  ggplot() +
  geom_sf_interactive(data = combined_census, 
                      aes(fill = obesity_bin,
                          data_id = maptext,
                          tooltip = maptext),
                      lwd = 0,
                      color = NA) +
  
  scale_fill_brewer(palette = "RdYlGn",direction  = -1,  aesthetics = "fill", guide = "none")+
  labs(title = "Adult obesity rate by census tract in the US")+
  coord_sf(xlim=c(-160, -64),ylim=c(-30,72), crs = 4326) +
  theme_void()

girafe(ggobj = obesity_tracts_int) %>%
  girafe_options(opts_hover(css = "fill:blue;"), opts_sizing(rescale = TRUE))

(Supervised) Machine Learning

Setting up the testing environment

Code
# separate geometry column into longitude and latitude, and reduce the number of variables to prepare it for recipe 
combinedsmall <- combined %>%
  select(census_tract, COPD, OBESITY, STROKE, DEPRESSION, LPA, CASTHMA, TEETHLOST, ARTHRITIS, DIABETES, BINGE, SLEEP, ACCESS2, PHLTH, DENTAL, MHLTH, CANCER, CHD, GHLTH, CHECKUP, CSMOKING, CERVICAL, KIDNEY, COLON_SCREEN, COREW, urban, median_family_income, la1and10, lalowi1share, lakids1share, laseniors1share, lahunv1share)


combinedsep <- combinedsmall %>%
  mutate(longitude = st_coordinates(geometry)[, "X"],
         latitude = st_coordinates(geometry)[, "Y"]) 

# to remove the geometry column from dataframe 
combinedsep <- st_drop_geometry(combinedsep)

# convert all variables to numeric
combinedsep$urban <- as.numeric(as.character(combinedsep$urban))

# set seed
set.seed(20230507)

# split the data into training and testing sets 
obesity_split <- initial_split(data = combinedsep, 
                               prop = 0.8)

obesity_train <- training(x = obesity_split)
obesity_test <- testing(x = obesity_split)

# set up v-fold cross validation 
folds <- vfold_cv(data = obesity_train, v = 5, repeats = 1)

# create a recipe 
obesity_rec <- recipe(OBESITY ~., data = obesity_train) %>%
  step_other(census_tract) %>%
  step_naomit(all_predictors()) %>%
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors()) %>%
  step_dummy(census_tract)

Models

Linear Regression

This is a simple and widely-used statistical model for predicting a continuous outcome based on one or more predictor variables. The goal is to find a linear relationship between the outcome and predictors, which can be used to make predictions.

Code
# create a linear regression model
lm_mod <- linear_reg() %>%
  set_engine("lm")

# create a workflow 
lm_wf <- workflow() %>%
  add_recipe(obesity_rec) %>%
  add_model(lm_mod) 

# fit the model 
lm_cv <- lm_wf %>%
  fit_resamples(resamples = folds)

# select the best model based on the "rmse" metric
lm_best <- lm_cv %>%
  select_best("rmse")

# finalize workflow
lm_final <- finalize_workflow(
  lm_wf,
  parameters = lm_best
)

# fit to the training data and extract coefficients
lm_coefs <- lm_final %>%
  fit(data = obesity_train) %>%
  extract_fit_parsnip() %>%
  vi(lambda = lasso_best$penalty)

LASSO Model

This is a linear regression model with a regularization term added to the objective function, which helps to prevent overfitting and improve generalization performance. The LASSO penalty shrinks some of the regression coefficients towards zero, effectively selecting a subset of the most important predictors for the outcome.

Code
# create a tuning grid for lasso regularization, varying the regularization penalty
lasso_grid <- grid_regular(penalty(), levels = 10)

# create a linear_regression model to tune the penalty parameter
lasso_mod <- linear_reg(
  penalty = tune(), 
  mixture = 1
) %>%
  set_engine("glmnet")

# create a workflow using your updated linear regression model 
lasso_wf <- workflow() %>%
  add_recipe(obesity_rec) %>%
  add_model(lasso_mod) 

# perform hyperparameter tuning using the lasso_grid and cross validation folds
lasso_cv <- lasso_wf %>%
  tune_grid(
    resamples = folds,
    grid = lasso_grid
  )

# select the best model based on the "rmse" metric
lasso_best <- lasso_cv %>%
  select_best(metric = "rmse")

# finalize workflow and finding the best model
lasso_final <- finalize_workflow(
  lasso_wf,
  parameters = lasso_best
)

# fit to the training data and extract coefficients
lasso_coefs <- lasso_final %>%
  fit(data = obesity_train) %>%
  extract_fit_parsnip() %>%
  vi(lambda = lasso_best$penalty) 

Ridge Model

This is also a linear regression model with a regularization term, but instead of selecting a subset of predictors, it shrinks all of the regression coefficients towards zero. This can help to reduce the impact of multicollinearity among the predictors.

Code
# create a tuning grid for ridge regularization, varying the regularization penalty
ridge_grid <- grid_regular(penalty(), levels = 10)

# create a linear_regression model to tune the penalty parameter
ridge_mod <- linear_reg(
  penalty = tune(), 
  mixture = 0) %>%
  set_engine("glmnet")

# create a ridge workflow using your updated linear regression model 
ridge_wf <- workflow() %>%
  add_recipe(obesity_rec) %>%
  add_model(ridge_mod)

# perform hyperparameter tuning using the on ridge hyperparameter grid and use cross_validation folds 
ridge_cv <- ridge_wf %>%
  tune_grid(
    resamples = folds,
    grid = ridge_grid
  )

# select the best model based on the "rmse" metric
ridge_best <- ridge_cv %>%
  select_best(metric = "rmse")

# finalize workflow
ridge_final <- finalize_workflow(
  ridge_wf,
  parameters = ridge_best
)

# fit the final ridge model to the full training data and extract coefficients
ridge_coefs <- ridge_final %>%
  fit(data = obesity_train) %>%
  extract_fit_parsnip() %>%
  vi(lambda = ridge_best$penalty) 

Elastic Net

This is a combination of LASSO and ridge regression, with both penalties included in the objective function. This can help to balance the benefits of variable selection and coefficient shrinkage, and is useful when the data contains many correlated predictors.

Code
# create a tuning grid for elastic net regularization, varying the regularization penalty
elastic_net_grid <- grid_regular(penalty(), levels = 10)

# create a linear_regression model to tune the penalty parameter
elastic_net_mod <- linear_reg(
  penalty = tune(), 
  mixture = 0.5) %>%
  set_engine("glmnet")

# create an elastic net workflow using your updated linear regression model
elastic_net_wf <- workflow() %>%
  add_recipe(obesity_rec) %>%
  add_model(elastic_net_mod)

# perform hyperparameter tuning using the on your elastic net hyperparameter grid and use cross_validation folds 
elastic_net_cv <- elastic_net_wf %>%
  tune_grid(
    resamples = folds,
    grid = elastic_net_grid)

# select the best model based on the "rmse" metric
elastic_net_best <- elastic_net_cv %>%
  select_best(metric = "rmse")

# finalize workflow
elastic_net_final <- finalize_workflow(
  elastic_net_wf,
  parameters = elastic_net_best)

# fit the final elastic net model to the full training data and extract coefficients
elastic_net_coefs <- elastic_net_final %>%
  fit(data = obesity_train) %>%
  extract_fit_parsnip() %>%
  vi(lambda = elastic_net_best$penalty)

Comparing the Different Models

Code
# the models are compared for prediction accuracy
bind_rows(
  `lm` = show_best(lm_cv, metric = "rmse", n = 1),
  `LASSO` = show_best(lasso_cv, metric = "rmse", n = 1),
  `ridge` = show_best(ridge_cv, metric = "rmse",n = 1),
  `enet` = show_best(elastic_net_cv, metric = "rmse", n = 1),
  .id = "model"
)
# A tibble: 4 × 8
  model .metric .estimator  mean     n std_err .config                   penalty
  <chr> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                       <dbl>
1 lm    rmse    standard    2.55     5 0.00754 Preprocessor1_Model1       NA    
2 LASSO rmse    standard    2.55     5 0.00747 Preprocessor1_Model01       1e-10
3 ridge rmse    standard    2.70     5 0.00735 Preprocessor1_Model01       1e-10
4 enet  rmse    standard    2.55     5 0.00749 Preprocessor1_Model01       1e-10
Code
all_coefs <- bind_rows(
  `lm` = lm_coefs,
  `LASSO` = lasso_coefs,
  `ridge` = ridge_coefs,
  `enet` = elastic_net_coefs,
  .id = "model"
) 

all_coefs %>%
  group_by(model) %>%
  slice_max(Importance, n = 10) %>%
  ggplot(aes(Importance, Variable, fill = model)) +
  geom_col(position = "dodge")

Code
all_coefs %>%
  filter(model != "lm") %>%
  group_by(model) %>%
  slice_max(Importance, n = 10) %>%
  ggplot(aes(Importance, Variable, fill = model)) +
  geom_col(position = "dodge")

Code
# compare the regularized coefficients to the lm coefficients for all three models
plot1 <- left_join(
  rename(lm_coefs, lm = Importance),
  rename(lasso_coefs, LASSO = Importance),
  by = "Variable"
) %>%
  ggplot(aes(lm, LASSO)) +
  geom_point(alpha = 0.3)

plot2 <- left_join(
  rename(lm_coefs, lm = Importance),
  rename(ridge_coefs, ridge = Importance),
  by = "Variable"
) %>%
  ggplot(aes(lm, ridge)) +
  geom_point(alpha = 0.3)

plot3 <- left_join(
  rename(lm_coefs, lm = Importance),
  rename(elastic_net_coefs, enet = Importance),
  by = "Variable") %>%
  ggplot(aes(lm, enet)) +
  geom_point(alpha = 0.3)

plot1 + plot2 + plot3

Decision Tree Model

This is a tree model. The first model is not tuned and the second model is tuned using the same cross-folds as the above models, but does not show an improvement in rmse compared to the non-tuned model. This model was interesting to examine to understand what factors the model considers as driving obesity rates more than others. As you can see in the variable importance plot below, current smoking rates, lost teeth rates, “general health”, low rates of no liesure-time physical activity and physical health as relatively important variables in predicting obesity rates. Median family income was also in the top 10 of important variables.

Code
# create a model
tree_mod <- 
  decision_tree() %>%
  set_engine(engine = "rpart") %>%
  set_mode(mode = "regression")

# create a workflow
tree_wf <- workflow() %>%
  add_recipe(obesity_rec) %>%
  add_model(tree_mod)

# fit model to the training set 
tree_fit <- tree_wf %>%
  fit(data = obesity_train)

# create a tree 
rpart.plot::rpart.plot(x = tree_fit$fit$fit$fit, 
                       roundint = FALSE)

Code
# make predictions on train data
predictions <- bind_cols(
  obesity_train,
  predict(object = tree_fit, 
          new_data = obesity_train)
)

pred_tree <- predict(object = tree_fit, 
                     new_data = obesity_train) 

# evaluating model
predictions <- predictions %>% 
  drop_na(OBESITY, .pred) %>% 
  mutate(
    error = .pred - OBESITY,
    sqerror = error^2
  )

# calculating rmse on non-tuned model
rmse_tree <- sqrt(mean(predictions$sqerror))
rmse_tree
[1] 3.930356
Code
# variable importance
tree_fit %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10)

Code
# tuning tree
gridval <- 10

tree_res <- tree_wf %>%
  tune_grid(resample = folds, 
            grid = gridval,
            control = control_grid(save_pred = TRUE))

tree_best <- tree_res %>%
  select_best(metric = "rmse")

tree_tuned <- tree_wf %>%
  finalize_workflow(parameters = tree_best)

tree_final <- fit(tree_tuned, data = obesity_train)

# variable importance
tree_final %>% 
  extract_fit_parsnip() %>% 
  vip(num_features = 10)

Code
# predictions with tuned model
newpredictions <- bind_cols(
  obesity_test,
  predict(tree_final, new_data = obesity_test)
)

# evaluating model
newpredictions <- newpredictions %>% 
  drop_na(OBESITY, .pred) %>% 
  mutate(
    error = .pred - OBESITY,
    sqerror = error^2
  )

# calculating rmse on tuned model
rmse_prunedtree <- sqrt(mean(newpredictions$sqerror))
rmse_prunedtree
[1] 3.962811

Results

Best Models

In order to pick the best model, we consider the RMSE as the metric used for evaluation, the mean value of the metric across the folds, and the standard error of the metric. Generally, we want to choose the model that has the lowest value of the metric, as this indicates better performance on the test data.

The RMSE values for lm, LASSO, and enet are all the same at 2.55. However, the ridge model has a slightly higher RMSE of 2.69 and the decision tree model has an rmse of 3.96. This suggests that the lm, LASSO, and enet models may be better choices than the ridge model.

Additionally, we consider the standard error of the metric. The standard error gives an estimate of the precision of the mean estimate, with a lower value indicating more precision. In our case, we can see that the standard errors for all models are quite small, indicating that the mean estimates are likely quite precise.

Most Important Variables

The two most important variables for each model are consistently diabetes and arthritis across the board with regard to obesity. Specifically, people with diabetes and arthritis may be more likely to develop obesity.

Other factors such as income, having children, sleep, mental health, leisure-time physical activity, smoking, and binge drinking also play a role in predicting obesity. This indicates that obesity is a multifaceted issue that is influenced by various lifestyle and health-related factors. Thus, a comprehensive approach is needed to tackle the issue of obesity. Additionally, certain health issues such as cancer, overall poor health, heart disease, and kidney problems may not be predictors of obesity but rather can be consequences of it. This highlights the importance of early prevention and intervention to prevent obesity and its associated health complications.

Evaluating discrepancies: If a variable has a high importance in one model, but not in other models, it means that the other models are able to achieve comparable prediction accuracy without relying heavily on that particular variable. In this case, the variable BINGE (binge drinking) is highly important in the lm model but not in LASSO, ridge, and enet models. This suggests that BINGE may be correlated with other predictors, and the regularization techniques in LASSO, ridge, and enet models are able to effectively deal with this multicollinearity by shrinking the coefficient of BINGE to zero or close to zero.

Therefore, it is likely that BINGE does not contribute much additional information in predicting obesity after accounting for the other predictors in the LASSO, ridge, and enet models. However, the lm model may be overfitting to BINGE due to the absence of regularization, which could explain why it is highly important in that model.